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Abstract 

Analysis-by-synthesis has been a successful approach for many tasks in com¬ 
puter vision, such as 6D pose estimation of an object in an RGB-D image which is 
the topic of this work. The idea is to compare the observation with the output of a 
forward process, such as a rendered image of the object of interest in a particular 
pose. Due to occlusion or complicated sensor noise, it can be difficult to perform 
this comparison in a meaningful way. We propose an approach that “learns to com¬ 
pare”, while taking these difficulties into account. This is done by describing the 
posterior density of a particular object pose with a convolutional neural network 
(CNN) that compares an observed and rendered image. The network is trained 
with the maximum likelihood paradigm. We observe empirically that the CNN 
does not specialize to the geometry or appearance of specific objects, and it can 
be used with objects of vastly different shapes and appearances, and in different 
backgrounds. Compared to state-of-the-art, we demonstrate a significant improve¬ 
ment on two different datasets which include a total of eleven objects, cluttered 
background, and heavy occlusion. 


1 Introduction 

Tremendous effort has focused on the tasks of object instance detection and pose esti¬ 
mation in images and videos. In this paper we consider the pose estimation in a single 
RGB-D image, as shown in Fig. 1. Given the extra depth channel, it becomes feasible to 
extract the full 6D pose (3D rotation and 3D translation) of object instances present in 
the scene. Pose estimation has important applications in many areas, such as robotics 
[21, 32], medical imaging [24], and augmented reality [12]. Recently, Brachmann 
et al. [5] achieved state-of-the-art results by adapting analysis-by-synthesis approach 
for pose estimation in RGB-D images. They use a random forest [6] to obtain pixel- 
wise dense predictions. Building upon the system of [5], we propose a novel method 


1 



Figure 1: Three pose estimation results from the occlusion dataset from [5] and [14]. 
Arrows indicate the positions of estimated and ground truth poses. The green silhouette 
indicates the ground truth pose, the blue silhouette corresponds to our estimated pose. 
Red indicates the pose estimate from [5]. The marker board served only for ground 
truth annotation. 


to learn to compare in the analysis-by-synthesis framework. We use a convolutional 
neural network (CNN) inside a probabilistic context to achieve this. 

Analysis-by-synthesis has been a successful approach for many tasks in computer 
vision, such as object recognition [13], scene parsing [15], pose estimation and tracking 
[9]. A forward synthesis model generates images from possible geometric interpreta¬ 
tions of the world, and then selects the interpretation that best agrees with the measured 
visual evidence. In particular for pose estimation, the idea is to compare the observa¬ 
tion with the output of a forward process, such as a rendered image of the object of 
interest in a particular pose. When attempting pose estimation in RGB-D images, com¬ 
paring for analysis-by-synthesis is nontrivial due to occlusion or complicated sensor 
noise. There are for example areas with no depth measurements in Kinect or poor 
IR-reflectance. 


2 






1.1 


Contributions 


• We achieve considerable improvements over state-of-the-art methods of pose 
estimation in RGB-D images with heavy occlusion. 

• To the best of our knowledge, this work is the first to utilize a convolutional 
neural network (CNN) as a probabilistic model to learn to compare rendered and 
observed images. 

• We observe that the CNN does not specialize to the geometry or appearance of 
specific objects, and it can be used with objects of vastly different shapes and 
appearances, and in different backgrounds. 

The paper is organized as follows. Section 2 provides an overview of related work. 
Our proposed approach is described in Sec. 3. In Sec. 4 we present evaluation of our 
method compared to the state-of-the-art on two datasets. We conclude the paper in 
Sec. 5. 

2 Related Work 

A large body of work in computer vision has focused on the problem of object detection 
and pose estimation, including instance and category recognition, rigid and articulated 
objects, and coarse (quantized) and accurate (6D) poses. Pose estimation has been 
an active topic, ranging from template-based approaches [14, 8], sparse feature-based 
approaches [21], and dense approaches [25, 5]. In the brief review below, we focus on 
techniques that specifically address CNNs and analysis-by-synthesis. 

CNNs. are driving advances in computer vision in recent years, such as image clas¬ 
sification [16], detection [31], recognition [2, 23], semantic segmentation [20], pose 
estimation [27]. CNNs have shown remarkable performance in the large-scale visual 
recognition challenge (ILSVRC2012). The success of CNNs is attributed to their abil¬ 
ity to learn rich feature representations as opposed to hand-designed features used in 
previous image classification methods. In [11], rich image and depth feature represen¬ 
tations have been learned with CNNs to detect objects in RGB-D images. In [1], CNNs 
are used to generate an RGB image given the set of 3D chair models, the chair type, 
viewpoint and color. Very recent work from Gupta et al [10] uses object instance seg¬ 
mentation output from [11] to infer 3D object pose in RGB-D images. Another CNN is 
used to predict the coarse pose of the object. This CNN is trained using pixel normals in 
images containing rendered synthetic objects. This coarse pose is used to align a small 
number of prototypical models to the data, and place the model that fits the best into 
the scene. Different from above approaches, we use a CNN as a probabilistic model to 
compare rendered and observed images. The output of our CNN is the energy value, 
while in [10] the output of the CNN is the object pose. In [7], a similarity metric is 
learned. The learning process minimizes a discriminative loss function. A CNN with 
Siamese architecture is used for mapping two face feature spaces. Similarly, in [29] 
Wohlhart and Lepetit train a CNN to map image patches to a descriptor space, where 
pose estimation and object recognition is solved using the nearest neighbor method. 
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Our framework is probabilistic. The posterior distribution of the pose is modelled as 
a Gibbs distribution with a CNN as energy function. Zbontar and LeCun [30] train a 
CNN to predict how well two image patches match and use it to compute the stereo 
matching cost. The cost is minimized by cross-based cost aggregation and semi-global 
matching, followed by a left-right consistency check to eliminate errors in the occluded 
regions. While in [30] the CNN is used for comparing two image patches, our CNN is 
used to to compare rendered and observed images. 

Analysis-by-synthesis has been a successful approach for many tasks in computer vi¬ 
sion, such as object recognition [13], scene parsing [15], viewpoint synthesis [13], ma¬ 
terial classification [28], and gaze estimation [26]. All these approaches use a forward 
model to synthesize some form of image, which is compared to observations. Many 
works learn feature representation and compare in feature space. For instance, in [13] 
the analysis-by-synthesis strategy has been used for recognizing and reconstructing 
3D objects in images. The forward model synthesizes visual templates defined on in¬ 
variant features. Gall et al. [9] propose an analysis-by-synthesis framework for motion 
capture and tracking. It combines patch-based and region-based matching to track body 
parts. Patch-based matching extracts correspondences between two successive frames 
for prediction and between the current image and a synthesized image for avoiding 
drift. Recently, Brachmann et al [5] achieved state-of-the-art results by adapting clas¬ 
sical analysis-by-synthesis approach for 6D pose estimation of specific objects from a 
single RGB-D image. They use a new representation in form of a joint dense 3D ob¬ 
ject coordinate and object class labeling. The major difference to our work, is that we 
learn to compare in the analysis-by-synthesis approach. For the problem of 6D pose 
estimation, due to occlusion or complicated sensor noise, it can be difficult to compare 
the observation with the output of a rendered image of the object of interest in a partic¬ 
ular pose. In this paper, we propose an approach, which draws on recent successes of 
CNNs. Different from aforementioned approaches, we model the posterior density of a 
particular object pose with a CNN that compares an observed and rendered image. The 
network is trained with the maximum likelihood paradigm. One of the most closely 
related works is [18]. They use a CNN as a part of probabilistic model. The CNN is 
fed in a sequential manner, first with the rendered image, then with the observed image. 
This produces two feature vectors, which are compared in the subsequence step, to give 
the probability of the observed image. In contrast to [18], we jointly input the rendered 
and observed images into a CNN to produce an energy value. The major difference is 
that our CNN is trained, while they take a pre-trained CNN as feature extractor. 

2.1 Review of the Pose Estimation Method [5] 

We will now describe the system from [5] in detail, because it is of particular relevance 
for our method. Brachmann et al [5] achieved state-of-the-art results by using a ran¬ 
dom forest [6] to obtain pixelwise dense predictions, which facilitate pose estimation. 
Each tree in their forest is trained to jointly predict to which object a pixel belongs to 
and, where it is located on the surface of this object. A tree outputs a soft segmen¬ 
tation image for each object with values between 0 and 1, indicating whether a pixel 
belongs to the object or not. The predictions of different trees are then combined to 
a single object probability. Additionally each tree outputs 3D object coordinates for 
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each object and each pixel. The term object coordinates refers to the coordinates in the 
local coordinate system of the object. When estimating the pose of a particular object, 
Brachmann et al [5] utilize the forest predictions in two ways: 

Firstly, it is used to define an energy function, which is minimized to obtain the 
final pose. All aspects of the energy follow the analysis-by-synthesis principle. It is 
based on a pixel wise comparison between the predictions, the recorded depth values 
and rendered images of the object in the particular pose. In detail, three comparisons 
are done: (a) the rendered depth image of the object is compared to the recorded depth 
image; (b) the rendered image of object coordinates is compared to the predicted ob¬ 
ject coordinates; (c) the rendered segmentation mask of the object is compared to the 
predicted object class probability for the object. The pixel wise error inside the segmen¬ 
tation mask is aggregated and divided by the area of the mask. Robust error measures 
are used to deal with outliers. 

Secondly, they use the forest predictions for an efficient optimization scheme to 
minimize the energy described above. It consists of two steps. The pixel wise object 
class probabilities are used inside the RANSAC pose estimation. In detail, sets of 
three pixels are sampled depending on the object class probability. For each set a 
pose hypothesis is calculated using the 3D-3D-correspondences between the camera 
coordinates, provided by the depth camera, and the object coordinates predicted by the 
forest. The best hypotheses, according to the energy function, are refined in a final 
step. Refinement is done by repeatedly determining inlier pixels in the rendered mask 
of the object, and again using the correspondences they provide to calculate a better 
pose. Finally, the pose with the lowest energy is taken as the final estimate. 

In our work we build upon the framework of [5]. As in [5] we use the regression- 
classification random forest to obtain the predictions described above. We also use 
their optimization scheme, but replace the energy function with a novel one, based on a 
CNN, that is trained. The key difference is that while energy function in [5] has only a 
few parameters which can be trained via discriminative cross-validation procedure, the 
CNN has around 600K which we train with a maximum likelihood procedure. We show 
that this richness of parameters makes remarkable difference, and practical challenges 
such as occlusion and noise are much better dealt with. This approach will be described 
in the next section. 


3 Method 

We will first give a description of the pose estimation task and introduce our terminol¬ 
ogy. Then we will describe our probabilistic model. The heart of this model is a CNN, 
which will be discussed subsequently. This is followed by a description of our max¬ 
imum likelihood training procedure of the probabilistic model. Finally our inference 
procedure at test time is described. Fig. 2 gives an overview of our testing pipeline. 
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3.1 The Pose Estimation Task 


We will now formally define the task of 6D pose estimation. Our goal is to estimate 
the pose H of di rigid object^ from a set of observations denoted by x, which will be 
discussed later. A pose describes the transformation from the local coordinate system 
of the object to the coordinate system of the camera. The local coordinate system 
has its origin in the center of the object. Each pose H = (i?, T) is a combination 
of two components. The rotational component i? is a 3 x 3 matrix describing the 
rotation around the center of the object. The translational component T is a 3D vector 
corresponding to the position of the object center in the camera coordinate system. 

Let us now describe the observation x that is used to estimate the object pose. We 
use RGB-D images as input. However, since we use the same random forest predictions 
as in [5], the term observation or observed images will refer to two parts: (a) the forest 
predictions as described in [5], as well as (b) the recorded depth image. The reason for 
this simplified view is that the focus of our work lies on the modeling of the posterior 
density and aspects of the random forest prediction. 


3.2 Probabilistic Model 


We model the posterior distribution of the pose H given the observations x as a Gibbs 
distribution 


p(i^|x;0) 


exp ( — E{H, x; 9)) 
f exp ( — E(II, x; 9))dII 


( 1 ) 


where E {H, x; 6) is the so called energy function. The energy function is a mapping 
from a pose H and the observed images x to a real number, parametrized by the vector 
6. Note that using a Gibbs distribution to model the posterior is a common practice 
for conditional random fields (CRFs) [19]. However, the underlying energies are quite 
different. While in a CRF the energy function is a sum of potential functions, we 
implement it by using a CNN which directly outputs the energy value. The parameter 
vector 6 holds the weights of our CNN. 


3.3 Convolutional Neural Network 

In order to implement the mapping from a pose H and the observed images x to an 
energy value we first render the object in pose H to obtain rendered images r(i7). Our 
CNN then compares x with r(JT) and outputs a value /(x, r(i7); 0). We define the 
energy function as 

E{H,^;e)=f{^,r{H);e). (2) 

Our network is trained to assign a low energy values when there is a large agreement 
between observed images and renderings and a high energy value when there is little 
agreement. To perform the comparison we use a simple architecture, in which we feed 
all rendered and observed images as separate input channels into the CNN. 

Tt should be noted that we assume the object to be present in the field of view, i. e. we do not perform 
object recognition. 
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Figure 2: Our pipeline for the calculation of the energy function: Input and output 
are indicated by green arrows. The contents of the dashed box consists of preparatory 
steps, that have to be computed only once per image, (a) The RGB-D we will base 
our estimate on. The image is processed by a random forest to calculate predictions, 
(b) The predicted object probabilities and object coordinates. In the probability image 
bright pixels indicate a high probability. In the object coordinate images the 3D object 
coordinates are mapped to the RGB cube for visualization. There are multiple object 
coordinate images. Each one represents the prediction of one tree. The object coordi¬ 
nates are combined to a single image [5]. (c) The pose we want to calculate the energy 
for. (d) A 3D model of the object, (e) Images produced by rendering the 3D model 
in the input pose. We render an object coordinate image and a depth image. We only 
use cutouts around the object, (f) Images of equal size are cutout from the predicted 
object probabilities, object coordinates and from the recorded depth image, (g) Finally 
the rendered and observed images are processed and fed into the CNN (Sec. 3.3). The 
single output of the CNN is our energy function. 


Note that we consider only a square window around the center of the object with 
pose H. The width of the window is adjusted according to the size and distance of the 
object, as suggested by [5]. For performance reasons windows which are bigger than 
100x100 pixels are down sampled to this size. We use in total six input channels for 
our network. Note that Fig. 2 shows the images from which these six input channels 
are derived. 

One observed depth channel and one rendered depth channel that contain values in 
millimeters. They are normalized by subtracting the z component of the object position 
according to H. 

One rendered mask channel of the object. Pixel values are either +1 for all pixels 
belonging to the object or —1 otherwise. 
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One depth mask channel indicating whether a depth value was measured in the pixel. 
Again, pixel values are either +1 for all pixels where a depth was measured or — 1 
otherwise. 

One probability channel holding the combined pixel wise object probabilities from all 
trees. The values are re-scaled to lie between —1 and +1. 

One object coordinate channel holding the pixel wise Euclidean distances between the 
rendered object coordinates and the predicted object coordinate from the tree giving the 
highest object probability for the respective pixel. We divide all values by the object 
diameter for normalization. 

The tanh activation function is used after every convolution layer and after every 
fully connected layer. The first convolution layer Ci consists 128 convolution kernels 
of size 3x3x6. The second convolution layer C 2 consists of 128 kernels of size 
3 X 3 X 128, which is followed by a 2 x 2 max-pooling layer with stride 2 in each 
direction. The third convolution layer C 3 is identical to C 2 . The fourth convolution 
layer consists of 256 kernels of size 3 x 3 x 128. It is followed by a max-pooling 
operation over the remaining image size. The 256 channels are further processed by 
two fully connected layers with 256 neurons each and finally forwarded to a single 
output unit. 


3.4 Maximum Likelihood Training 

In training we want to find an optimal set of parameters based on labeled training 
data L = (xi, i^i)... (x^, i7^), where x^ shall denote observations of the i-th training 
image and Hi the corresponding ground truth pose. We apply the maximum likelihood 
paradigm and define 

n 

= argmax^^ lnp(i^i|xi; 0). (3) 

e ^ 

1=1 

In order to solve this optimization task we use stochastic gradient descent [3], which 
requires calculating the partial derivatives of the log likelihood for each training sample 

‘"Hff.lx.; 

+E 


^E{H,Xi;e) \xi;e 

90j J 


with respect to each parameter Oj. Here E['|x^; 0] stands for the conditional expected 
value according to the posterior distribution jx^; 6), parametrized by 6. While the 
partial derivatives of the energy function can be calculated by applying back propaga¬ 
tion in our CNN, the expected value cannot be found in closed form. Therefore, we use 
the Metropolis algorithm [22] to approximate it, as discussed next. 

Sampling. It is possible to approximate the expected value in Eq. (4) by a set of pose 
samples 


E 


—E{H,Xi;e) |xj ;0 
Odn ' 


N 


k=l ^ 


(5) 
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where Hi... diXt pose-samples drawn independently from the posterior p(i7|x; 6) 
with the current parameters 9. We use the Metropolis algorithm [22] to generate these 
samples. It allows sampling from any distribution with a known density function that 
can be evaluated up to a constant factor. The algorithm generates a sequence of samples 
Ht by repeating two steps: 

1. Draw a new proposed sample H' according to a proposal distribution Q{H' \Ht). 

2. Accept or reject the proposed sample according to an acceptance probability 
A{H'\Ht). If the proposed sample is accepted set = H'. If it is rejected 
set i^t+i = Ht. 

The proposal distribution (5(i7'Ihas to be symmetric, i.e. Q{H'\Ht) = Q{Ht\H'). 
Our particular proposal distribution will be described in detail in the next section. The 
acceptance probability is in our case defined as 



( 6 ) 


meaning that whenever the posterior density p{H'\x;6) at the proposed sample is 
greater than the posterior density (i^t|x; 6) at the current sample, the proposed sam¬ 
ple will automatically be accepted. If this is not the case it will be accepted with the 
probability Ix; 0)/p(i7t|x; 6). 

Proposal Distribution. A common choice for the proposal distribution is a normal 
distribution centered at the current sample. In our case this is not possible because 
the rotational component of the pose lives on the manifold SO{3), i.e. the group of 
rotations. We define Q{H'\Ht) implicitly by describing a sampling procedure and 
ensuring that it is symmetric. The translational component T' of the proposed sample 
is directly drawn from a 3D isotropic normal distribution M{Tt^ St) centered at the 
translational component Tt of the current sample Hf. The rotational component R' of 
the proposed sample H' is generated by applying a random rotation R to the rotational 
component Rt of the current sample: R' = RRt. 

We calculate R as the rotation matrix corresponding to an Euler vector ^ e, which 
is drawn from a 3D zero centered isotropic normal distribution e ^ A/'(0, T^r). 
Initialization and Burn-in-phase. When the Metropolis algorithm is initialized in 
an area with low density it requires more iterations to provide a fair approximation 
of the expected value. To find a good initialization we run our inference procedure 
(described in the next section) using the current parameter set. We then perform the 
Metropolis algorithm for a total of 130 iterations, disregarding the samples from the 
first 30 iterations which are considered as burn-in-phase. 

3.5 Inference Procedure 

During test time we aim at finding the MAP estimate, i.e. the pose maximizing our 
posterior density as given in Eq. (1). Since the denominator in Eq. (1) is constant for 

3D vector represents a rotation. The direction of the vector describes the axis of the rotation and the 
length corresponds to the angle. 
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any given observation x, finding the MAP estimate is equivalent to minimizing our 
energy function. To achieve this, we utilize the optimization scheme from [5], but 
replace their energy function with ours. 


4 Experiments 

In the following we compare our approach to the state-of-the-art method of Brachmann 
et al in [5] for two different datasets. We first describe some implementation details of 
the competitor and introduce the datasets. After that we describe details of our training 
procedure, and finally present quantitative and qualitative comparison. We will see that 
we achieve considerable improvements for both datasets. Additionally, we observe that 
our CNN generalizes from a single training object to a set of 11 test objects, with large 
variability in appearance and geometry. 

4.1 Datasets, Competitors, Evaluation Protocol 

Datasets. We use two datasets featuring heavy occlusion. The first dataset was cre¬ 
ated by Brachmann et al . [5] by annotating the ground truth poses for eight partially 
occluded objects in images taken from the dataset of Hinterstoisser et al [14]. We 
will refer to this dataset as the occlusion dataset from [5] and [14]. It includes a total 
of 8992 test cases (images with different annotation), which are used for testing. We 
choose this dataset because it is more challenging than the original dataset from [14], 
on which [5] already achieves an average of 98.3% correctly estimated poses. 

The second dataset was introduced by Krull et al in [17]. It provides six annotated 
RGB-D sequences of three different objects and consists of a total of 3187 images. We 
use three of the sequences for training and the other three (a total of 1715 test images) 
for testing. 

Evaluation Protocol. We use the evaluation procedure as described in [5]. This means 
we calculate the percentage of correctly predicted poses for each sequence. As in [14] 
we calculate the average distance between the 3D model vertices under the estimated 
pose and under the ground truth pose. A pose is considered correct, when the average 
distance is below 10% of the object diameter. 

Competitors. We compare our method to the one presented in [5]. For doing so we 
needed to re-implement this method^. We observed that our re-implementation gives 
on average slightly superior results. In the following, we mostly report two numbers, 
those of our re-implementation and those of the method of [5], reported in [5] or [17]. 
For completeness we additionally provide the numbers from LineMOD [14] as reported 
in [5]. 

4.2 Training Procedure 

Random Forests. We used different random forests for training and testing on both 
datasets. The forests were kindly provided to us by the authors of [5]. 

^Our re-implementation is identical up to small details, which we discussed with the authors of [5]. 
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Training Validation Two of eleven test objects 


Figure 3: Images from one of our training-testing configurations: the Samurai J se¬ 
quence is used for training, the CatJ for validation. Sequences of all objects are used 
for testing. Note, the Objects are of vastly different shape and appearance. 


CNN. We trained three CNNs, each time using only a single object from the dataset 
provided by Krull et al. in [17]. The sequences Toolbox J, CatJ, and Samurai J 
served as training sets - see Fig. 3. The first 100 frames from Samurai J were re¬ 
moved in order to obtain a high percentage of frames with occlusion. Our validation 
set consists of 100 randomly selected frames from the CatJ sequence, or the Samu¬ 
rai J sequence (in the case where CatJ was used as training set). The weights of the 
CNN were randomly initialized. Before training, the random weights of the last layer 
were multiplied by factor 1000, in order to cover a greater range of possible energy 
values. After every 5th iteration of stochastic gradient descent, we perform inference 
on the validation set and adjust the learning rate. The learning rate at step t was pro¬ 
portional to 7t = 7o/(l + 7oAt) [4], with 70 = 10 and A = 0.5. After training we 
pick the set of weights which achieved the highest percentage of correctly estimates 
poses on the validation set. We use the criterion from [14] to classify a pose as correct. 
One training cycle consisting of five steps of stochastic gradient descent and validation 
took"^ 9min 46sec (2min 27sec - 1 - 7min 19sec). Further details on our training procedure 
can be found in the supplementary material. 

4.3 Comparison 

Occlusion Dataset from [5] and [14]. Quantitative results for this dataset are shown 
in Fig. 4, for all individual test and training objects. Considering the average over all 
objects we achieve an improvement of up to 9.23% compared to our re-implementation 
of [5] and 10.4% compared to the reported values in [5]. Some qualitative results are 
illustrated in Fig. 7. In Fig. 5 we show another comparison of our method with respect 
to [5]. It illustrates that we achieve the biggest gain for occlusion percentage between 
50% and 60%. 

Dataset of Krull et al For this dataset we observe similar results as with the previous 
dataset. Since the other sequences were used in training and validation, we evaluated 
only with the Toolbox J, CatJ, and Samurai J sequences. When averaged over all 
objects we achieve an improvement of 10.97% compared to the results of [5]. The 

^We used an Intel(R) Core (TM) i7-3820 CPU at 3.60GHz with GeForce GTX 660 GPU. The CatJ 
sequence was used for training and 100 random frames from Samurai J for validation. 
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Figure 4: Quantitative comparison of our method against the results of [5] and 
LineMOD [14] on the Occlusion Dataset from [5] and [14]. Circles, Squares, and 
Triangles indicate the individual performance of CNNs trained with Tool Box, Cat, and 
Samurai respectively. The green bars indicate the average result. Averaged over all 
test and training objects we obtain the correct pose in 72.98% of cases, in contrast to 
63.24% for [5] and 48.84% for LineMOD [14]. A table with the the detailed numbers 
can be found in the supplementary material. 


quantitative results can be found in Fig. 6, and a few qualitative results are shown in 
Fig. 8. 

Discussion of Failure Cases. The failure cases which are framed red in Fig. 7 have 
to be considered as failure of our learned energy function. However, the failure cases 
framed orange still exhibit a lower energy at the ground truth pose than at the estimate. 
This indicates a failure of the optimization scheme. It should be investigated in which 
case the correct pose can be found using an alternative optimization scheme. In the 
dataset introduced by Krull et al. our accuracy for the Tool Box sequences is below the 
one of our competitor (see Fig. 6). We attribute this to the fact that the Tool Box is the 
biggest object and most strongly affected by the down sampling schema described in 
Sec. 3.3. 

5 Conclusion 

We have presented a model for the posterior distribution in 6D pose estimation, which 
uses a CNN to map rendered and observed images to an energy value. We train the 
CNN based on the maximum likelihood paradigm. It has been demonstrated, that train¬ 
ing on a single object is sufficient and the CNN is able to generalize to different objects 
and backgrounds. Our system has been evaluated on two datasets featuring heavy oc¬ 
clusion. By using our energy as objective function for pose estimation, we were able to 
achieve considerable improvements compared to the best previously published results. 

Our approach is not restricted to the feature channels and even the application we 
demonstrated. The architecture of the CNN can in principle be applied to any kind of 
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Figure 5: The percentage of correctly estimated poses for all test cases of the occlusion 
dataset from [5] and [14], as a function of the level of occlusion. For this we divided 
the test cases into bins according to the amount of occlusion, using a bin width of 10%. 
(See details of this procedure in the supplementary material.) We compare our method 
(using the CNN trained with the Samurai object) to our re-implementation of [5]. We 
achieve improvements of over 20% for occlusion levels between 50% and 60%. 



Figure 6: Comparison of our method on the dataset of Krull et al. ., against the results 
of [5]. Circles, Squares, and Triangles indicate the individual performance of CNNs 
trained with Tool Box, Cat, and Samurai respectively. The green bars indicate the 
average result. We report 56.02%, 59.56%, and 54.65% correctly estimated poses for 
Tool Box, Cat, and Samurai respectively. Averaged over all test and training objects we 
achieve 56.74%. 


observed and rendered image. We think it would be worth investigating if the approach 
could be applied to other scenarios. An example could be pose estimation from pure 
RGB without recorded depth image and a forest to calculate features. Pose estimation 
for object classes could also benefit from our approach. Considering the recent success 
of CNNs in recognition [2, 23] it might be possible for a CNN to learn to compare ob¬ 
served images to renderings of an idealized model representing an object class instead 
of an instance. Our approach is not limited to comparing images of the same kind, as 
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Figure 7: Qualitative results of our method on the occlusion dataset from [5] and 
[14]. Here green and blue silhouettes correspond to the ground truth and our estimate, 
respectively. The test images depicted with a green frame show correct estimates. 
Images with orange and red frame show incorrect estimates. The image with an orange 
frame shows a case where the energy of the ground truth pose, according to Eq. (2), is 
lower than the energy of the estimated pose. In this case a better pose may be found 
with an improved optimization scheme. 


for example rendered and observed depth images. Instead, it could learn to asses the 
plausibility of the shading in an observed RGB by comparing it to a rendered depth 
image, which can be more easily produced than a realistic RGB rendering. 

An interesting future line of research could be to train a CNN to predict pose up¬ 
dates from observed and rendered images. This could replace the refinement step and 
might improve the results. 
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Figure 8: Qualitative results of our method on the test cases from the dataset intro¬ 
duced in [17]: Green frames correspond to correctly estimated poses according to the 
criteria from [14]. Orange frames correspond to incorrectly estimated poses with a 
lower energy at the ground truth than at the estimated pose. 
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